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With the number of sequenced genomes now over one hundred, and the availability of rough 
functional annotations for a substantial proportion of their genes, it has become possible to study 
the statistics of gene content across genomes. Here I show that, for many high-level functional 
categories, the number of genes in the category scales as a power-law in the total number of genes in 
the genome. The occurrence of such scaling laws can be explained with a simple theoretical model, 
and this model suggests that the exponents of the observed scaling laws correspond to universal 
constants of the evolutionary process. I discuss some consequences of these scaling laws for our 
understanding of organism design. 

What fraction of a genome's gene content is allotted to different functional tasks, and how does 
this depend on the complexity of the organism? Until recently, there was simply no data to address 
such questions in a quantitative way. Presently, however, there are more than 100 sequenced genomes 
(7 1 in public databases, and protein-family classification algorithms allow functional annotations for a 
considerable fraction of the genes in each genome. Thus, it has become possible to analyze the statistics 
of functional gene-content across different genomes, and here I present results on the dependency of the 
number of genes in different high-level categories on the total number of genes in the genome. 

Evaluating the functional gene-content of genomes 

To estimate the number of genes in different functional categories each genome has to be functionally 
annotated. The main results presented in this paper were obtained using the Interpro [l] annotations 
of sequenced genomes available from the European Bioinformatics Institute [2|. To map the Interpro 
annotations to high-level functional categories I used the Gene Ontology "biological process" hierarchy 
l3l and a mapping from Interpro entries to GO-categories both of which can be obtained from the gene 
ontology website |4 |. For each GO category I collect all Interpro entries that map to it or to one of 
its descendants in the "biological process" hierarchy. To minimize the effects of potential biases in the 
mappings from Interpro to GO I only use high-level functional categories that are represented by at least 
50 different Interpro entries. This leaves 44 high-level GO categories. 

A gene with multiple hits to Interpro entries that are associated with a GO category has a higher 
probability to belong to that category than a gene with only a single hit. To take this information into 
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account, I assume that if a gene i has independent hits to Interpro entries associated with GO category 
c, than with probabihty 1 — exp(— /?nf ) the gene belongs to the GO category. The results are generally 
insensitive to the value of /3 (see the appendix), and I used /? = 3 for the results shown below. The 
estimated number of genes ric in a genome for a given GO category is then the sum f^c = 1 ~ 
exp(— over all genes i in the genome. 
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Figure 1: The number of transcription regulatory genes (red), metabolic genes (blue), and cell cycle 
related genes (green) as a function of the total number of genes in the genome for bacteria (A) and 
eukaryotes (B). Both axes are shown on logarithmic scales. Each dot corresponds to a genome. The 
straight lines are power-law fits. Archaea are not shown because the range of genome sizes in archaea is 
too small for meaningful fits. For completeness the archaeal results are shown in figure|3]in the appendix. 

Figure[2shows the results for the categories of transcription regulatory genes (red), metabolic genes 
(blue), and cell-cycle related genes (green) for bacteria (panel A) and eukaryotes (panel B). Remarkably, 
for each functional category shown, we find an approximately power-law relationship (solid line fits)^ 
That is, if ric is the number of genes in the category, and g is the total number of genes in the genome, 
we observe laws of the form: ric = ^g'^, where both A and the exponent a depend on the category under 
study. In fact, such power-laws are observed for most of the 44 high-level categories, and the estimated 
values of the exponents for several functional categories are shown in Table [2 Note that a potential 
source of bias in estimating these exponents is the occurrence of multiple genomes from the same or 
closely-related species in the data. As shown in the appendix, removing this "redundancy" from the data 
does not alter the observed exponents. 

The power-law fits and 99% posterior probability estimates for their exponents were obtained using 
a Bayesian procedure described in the appendix. To assess the quality of the fits, I measured, for each fit, 
the fraction of the variance in the data that is explained by the fit. In bacteria, 26 out of 44 GO categories 
have more than 95% of the variance explained by the fit, 38 categories have more than 90% explained. In 
eukaryotes, 26 categories have more than 95% percent explained by the fit and 32 have more than 90% 
explained by the fit. However, with total gene number varying over less than two decades in bacteria, 

'These power-laws as a function of total gene number should be distinguished from the power-law distributions of gene- 
family sizes and other genomic attributes within a single genome 
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Estimated Exponents 


Category 


Bacteria 


Eukaryotes 


Transcription regulation 


1.87+/-0.13 


1.26+/-0.1 


Metabolism 


1.01+/-0.06 


1.01+/-0.08 


Cell cyclo 


0.47+/-U.U8 


(l.79^/-().16 


Signal transduction 


1.72+/-0.18 


1.48+/-0.39 


DNA repair 


0.64+/-0.08 


0.83+/-0.31 


DMA replication 


0.43+/-0.08 


0.72+/-0.23 


Protein biosynthesis 


0.13+/-0.02 


0.41+/-0.15 


Protein degradation 


0.97+/-0.09 


0.90+/-0.11 


Ion transport 


1.42+/-0.28 


1 .43+/-0.20 


Catabolism 


0.88+/-0.07 


0.92+/-0.08 


Carbohydrate metabolism 


1.01+/-0.1 1 


1 .36+/-0.36 


Two-component systems 


2.07+/-0.21 


NA 


Cell communication 


1.81+/-0.19 


1.58+/-0.34 


Defense response 


NA 


3.35+/-1.41 



Table 1: Estimates for the exponents of a selection of functional categories. The first number gives the 
maximum likelihood estimate of the exponent while the second indicates the boundaries of the 99% 
posterior probability interval. 

and the small number of data points in eukaryotes, one may wonder how one can claim that power-laws 
have been "observed". First, the fact that I fit the data to power-laws should not be mistaken for a claim 
that the data can only be described by power-law functions. I only claim that the power-law is by far the 
simplest functional form that fits almost all the observed data. Second, when scatter plots such as those 
shown in Fig. ^are plotted on linear as opposed to logarithmic scales, it is clear even by eye that the 
fluctuations in the number of genes in a category scale with the total number of genes in the genome. 
That is, the fluctuations in the data suggest that logarithmic scales are the "natural" scales for this data. 
This is further supported by the simple evolutionary model presented below. 

One may also wonder to what extent the results are sensitive to the specific functional annotation 
procedure. I performed a variety of tests to assess the robustness of the results, i.e. the observed power- 
law scaling and the values of the exponents, to changes in the annotation methodology (see appendix). 
These involve using entirely independent annotations based on Clusters of Orthologous Groups of pro- 
teins (COG) fTOl, and a simple (crude) annotation scheme based on keyword searches of protein tables 
for sequenced microbial genomes from the NCBI website |8|. As shown in the appendix, the observed 
power-law scaling, and the values of the exponents are generally insensitive to these and other changes 
in annotation methodology. It has to be noted, however, that all currently available annotation schemes, 
including the ones used here, predict function from sequence homology and thus at some level assume 
that functional homology can be inferred from sequence homology. The results reported here thus also 
depend on this assumption. 
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Observed Exponents 



Some functional categories, such as the large category of metabolic genes, occupy a roughly constant 
fraction of a genome's gene-content, as evidenced by their exponent of 1. However, many categories 
show significant deviations from this "trivial" exponent. Genes related to cell-cycle or protein biosyn- 
thesis have exponents significantly below 1, whereas for transcription factors (TFs) the exponent is sig- 
nificantly above 1. These trends are strongest in bacteria. There the exponent for TFs is almost 2, 
implying that as the number of genes in the genome doubles, the number of TFs quadruples. This has 
some interesting implications for "regulatory design" in bacteria. It implies that the number of TFs per 
gene grows in proportion to the size of the genome (see [9| for a similar observation). This in turn im- 
plies that, in larger genomes, each gene must be regulated by a larger number of TFs and/or each TF 
must be regulating a smaller set of genes. An exponent of 2 is also observed for two-component sys- 
tems^, which are the primary means by which bacteria sense their environment. This suggests that the 
relative increase in transcription regulators in more complex bacteria is accompanied by an equal relative 
increase in sensory systems. 

The difficulties with gene prediction and annotation in eukaryotes, the small number of available 
genomes, and our lack of understanding of the role of alternative splicing across eukaryotic genomes 
make it premature to draw many conclusions from Fig. IB. However, the main trends from Fig. lA 
are reproduced: the super linear scaling of TFs, the sub linear scaling of cell-cycle genes, and the small 
exponents for DNA replication and protein biosynthesis genes. 

The observed super linear scaling of TFs also has implications for our understanding of "combi- 
natorial control" in transcription regulation. It is well established that in complex organisms, different 
TFs combine into complexes to affect transcription control. Therefore, a relatively small number of TFs 
can implement a combinatorially large number of different transcription regulatory "states", which may 
correspond to particular external environments, developmental stages, tissues, combinations of exter- 
nal stimuli, etcetera. Each such regulatory state will be associated with a unique set of genes that are 
expressed in that state. If the number of such regulatory states were proportional to the total number 
of genes, then the number of TFs would increase much more slowly than the total number of genes. 
However, the scaling results show that, instead, the number of TFs increases more rapidly than the total 
number of genes. This thus implies that the number of regulatory states is also "combinatorial" in the 
total number of genes: a relatively small number of genes is used in different combinations to implement 
combinatorially many regulatory states. 

The picture that emerges is not of TFs being used in different combinations to implement the regu- 
latory needs of individual genes. But rather that, as one moves from simple to more complex organisms, 
the number of regulatory states grows so much faster than the total number of genes that, even with com- 
binatorial control of transcription, the number of TFs grows much faster than the total number genes. 

Evolutionary model 

One of course wonders about the origins of these scaling laws in genome organization, and I like to 
present some speculations in this regard. Assume that most changes in the number of genes ric in a 
functional category c are caused by duplications and deletions. Then, nc{t) generally evolves according 

^Note that two-component systems were not in the hst of 44 high-level categories. 
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to the equation 

= m) - 6it))n,it) = p{t)n,{t), (1) 

at 

with I3{t) and 5{t) respectively the duphcation- and deletion-rate of the genes in this category at time 
t in the evolutionary history of the genome. For simplicity of notation I have introduced the difference 
of duplication and deletion rates p{t), which can be thought of as an "effective" duplication rate. This 
rate p{t) is presumably proportional to the difference between the average probability that selection will 
favor fixation of a duplicated gene from this category and the average probability that selection will favor 
deletion of a gene from this category. Similarly, the total number of genes g{t) obeys the equation 

=^f^ = 7(ttoW, (2) 

with 7(t) the overall effective rate of gene duplication in the genome at time t in its evolutionary history. 
When we solve for ric as a function of g we find 

n, = \g^p)l^'<), (3) 

where (p) and (7) are the mean effective duphcation rates of genes in category c and the entire genome 
respectively, averaged over the evolutionary history of the genome, and A is a constant that depends on 
the boundary conditions. In order for all bacterial genomes to obey the same functional relation, the 
constant A and the ratios (p) / (7) have to be the same for all bacterial evolutionary lineages. Since all life 
shares a common ancestor, the boundary conditions for equations Q and ^ are trivially the same for all 
bacterial lineages, implying that the constant A is indeed the same for all bacterial lineages. In summary, 
simply assuming that changes in gene-number occur mostly through duplications and deletions implies 
our observed power-law scaling if the. ratios {p) / (7) are the same for all evolutionary lineages. 

I thus propose that the explanation for the observed scaling-laws is that the ratios (p) / (7) are indeed 
the same for all bacterial lineages, i.e. these ratios of average duplication rates are "universal constants" 
of genome evolution. For instance, the exponent 2 for TFs in bacteria indicates that, in all bacterial 
hneages, evolution selects duplicated TFs twice as frequently as duplicated genes in general. It seems 
hkely that such universal constants are intimately connected to fundamental design principles of the 
evolutionary process. It is tempting to become even more speculative in this regard, and suggest that 
this factor of 2 in duplication rate is related to switch-like function of transcription factors: with each 
addition of a transcription factor the number of transcription-regulatory states of the cell doubles. It is 
not entirely implausible to assume that with twice the number of internal states available, the probabihty 
of such a duplication being fixed in evolution is twice as large as the probability of fixing a duplicated 
gene that does not double the number of internal states of the cell. 

Finally, as table [2 shows, there is still substantial uncertainty about the exact numerical values of 
the exponents given the current data, and many more genomes are needed to estimate these values more 
accurately. A survey of the NCBI genome database [7J shows that the number of sequenced genomes is 
increasing exponentially, with a doubling time of about 16 months (Fig. |2li. This suggests that within 
a few years thousands of genomes will become available. With such an increase in available data it 
will become possible to look at much more fine-grained gene content statistics than the ones presented 
here. One can for instance imagine going beyond looking at single functional categories at a time, and 
investigate if there are correlations in the variations of gene number in more fine-grained functional 
categories. I believe that such investigations have the potential to teach us much about the functional 
design principles of the evolutionary process. 
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Growth of the number of sequenced genomes 
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Figure 2: The number of fully-sequenced genomes in the NCBI database Q as a function of time. The 
vertical axis is shown on a logarithmic scale. The straight line is a least squares fit to an exponential 
function: n{t) = 2(*-i994)/i.3_ 
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Appendix 



Results for Archaea 

Figure|3lshows the number of transcription regulatory genes (red), metabolic genes (blue), and cell-cycle 
related genes (green) as a function of the total number of genes in archaeal genomes. Note that the size 
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Figure 3: The number of transcription regulatory genes (red), metabolic genes (blue), and cell cycle 
related genes (green) as a function of the total number of genes in archaea. Both axes are shown on 
logarithmic scales. Each dot corresponds to a genome. The straight lines are power-law fits. 

of the largest archaeal genome differs from that of the smallest by only a factor of approximately 3. 
Consequently, there is large uncertainty regarding the values of the exponents for archaea. The maxi- 
mum likelihood values and 99% posterior probability intervals are 2.1 and [1.3 — 5.7] for transcription 
regulatory genes, 0.81 and [0.44 — 1.48] for metabolic genes, and 0.83 and [0.54 — 1.35] for cell-cycle 
related genes. 

Power-law Fitting 

The power-law fits in figure Q were obtained using a Bayesian straight-line fitting procedure. For each 
GO-category, I log-transform the data such that each data point {xi,yi) corresponds to the logarithm of 
the estimated total number of genes, and the logarithm of the estimated number of genes in the category 
respectively. I assume that these transformed data where drawn from a linear model 

y = a{x + e) + A + T/, 

where r/ and e are "noise" terms in the x- and y-coordinates respectively and A in an unknown off-set. I 
assume that the joint-distribution P{r], e) is a two-dimensional Gaussian with means zero and unknown 
variances and co-variance. That is, I use scale invariant priors for the variances, and will integrate these 
nuisance variables out of the likelihood. A uniform prior is used for the location parameter A, and for the 
slope a I will use a rotation invariant prior: 
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The use of these priors guarantees that the results are invariant under all shifts and rotations of the plane. 
Integrating over all variables except for a, we obtain for the posterior P{a\D) given the data D: 



P{a\D)da = C 



{a" + l)("-3)/2da 




, -2aSj,,, + Syy)("-i)/2 



where n is the number of genomes in the data, Sxx is the variance in x-values (logarithms of the total 
gene numbers), Syy the variance in y-values (logarithms of the number of genes in the category), Syx is 
the CO- variance, and C is a normalizing constant. The values of the exponents reported in tableware the 
values of a that maximize P{a\D), and the boundaries of the 99% posterior probability interval around 
it. 

For each fit, I also measure what fraction of the variance in the data is explained by the fit. That is, I 
compare the average distance d of the points in the plane to their center of mass with the average distance 
di of the points to the fitted line and define the fraction q = 1 — di/da.?, the variance in the data explained 
by the fit. 

Robustness of the results 

I first checked that the total amount of available annotation information is not itself dependent on genome 
size. If the total amount of available annotation information were to vary with the number of genes in 
the genome, this could lead to biases in the estimated exponents. To exclude this possibility, I counted 
the total number of genes with any Interpro hits in each genome and found that, for both bacteria and 
eukaryotes, the fraction of genes in the genome with Interpro hits is about 2/3, independent of the total 
number of genes in the genome. Consistent with this observation, when one fits power-laws to the number 
of genes ric in a GO-category c as a function of the total number of annotated genes in the genome, one 
finds exponents that are very close to those founds for ric as a function of the total number of genes in 
the genome. 

Second, I tested that the results are insensitive to the value of the parameter (3. The default value 
/3 = 3 gives a gene with a single Interpro hit a probability of 1 — e^'^ 0.95 to belong to the category. 
This is reasonable because Interpro is designed to only report statistically significant hits. To assess the 
effect of changing /5, results for /? = 1 were generated. For bacteria the change in fitted exponent is 
less than 5% for 26 of 44 categories, and less than 10% for 39 categories. For eukaryotes the exponent 
changes by less than 5% for all but 3 categories. In all cases, the change in fitted exponent is significantly 
smaller than the 99% posterior probability intervals associated with the exponents of the fits. 

Third, I tested the robustness of the results against removal of potential "redundancies" in the data. 
For bacteria there are several examples where multiple genomes of the same species or genomes of very 
closely related species occur in the data, and one might suspect that these may bias the results in some 
way. To this end, I parsed the names of all bacterial species into a general and a specific part, e.g. for 
Escherichia Coli Escherichia is the general part and coli the specific part, for Listeria innocua Listeria 
is the general part and innocua the specific part, etcetera. Groups of genomes with the same general part 
were then collected together and for each group the gene numbers were replaced with a single average 
of total gene number and average gene counts in each of the functional categories. This reduces the 
size of the data set by about a third. Power-law fitting was then applied to this reduced set and the 
fitted exponents were compared with those of the full data set. The maximum likelihood exponent was 
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changed by less than 5% in 31 out of 44 categories. The largest observed change was an 18% change in 
the exponent. All changes to the exponents were well within the 99% posterior probability intervals. 

More importantly, the results could depend on the use of Interpro, the Gene Ontology, and the map- 
ping of Interpro entries to GO categories. To test the robustness of the results to biases inherent in 
Interpro annotation and/or the mapping from Interpro to Gene Ontology, I analyzed selected functional 
categories using two other annotation schemes. 

The first is based on Clusters of Orthologous Groups of proteins (COG) 1 10 1 annotation of 63 bacte- 
rial genomes that can be obtained from 
|ftp://ftp.ncbi.nih.gov/pub/COG/COG] 

In this data set, proteins of the 63 bacterial genomes are assigned to COGs, and the COGs have been 
assigned to functional categories. I used these assignments to count the number of genes in different 
functional categories according to the COG annotation scheme. A comparison of the exponents for 
COG functional categories and the exponents for the closest GO categories obtained using the Interpro 
annotations are shown in table |2 

The second alternative annotation scheme I used is based on simple keyword searches of protein 
tables for fully-sequenced bacterial genomes available from the NCBI ftp site: 
ftp.ncbi. nlm. nih. go v/genomes/B acteria. 

Removing genomes for which little or no annotation exists, this leaves protein tables for 90 bacterial 
genomes. Each protein in these protein tables is annotated with a short description line. The number 
of genes in different functional categories was counted by searching each description line for hits to a 
set of keywords that characterize the category. For instance, I chose the keywords "ribosom", "trans- 
lation", and "tRNA" for the category "protein biosynthesis", and the gene is counted as belonging to 
this category if any of these keywords occurs in its description line. For the other categories I used the 
following keywords: "transcription" for the category "transcription"; "transport", "channel", "efflux", 
"pump", "porin", "export", "permease", "symport", "transloca", and "PTS" for the category "transport"; 
all combinations "X Y" with X being one of "ion", "sodium", "calcium", "potassium", "magnesium", 
and "manganese", and Y being one of "channel", "efflux", "transport", and "uptake" for the category 
"ion transport"; 'protease" and "peptidase" for the category "protein degradation"; "kinase" for the cat- 
egory "kinase"; and finally the phrases 'DNA polymerase", "topoisomerase", "DNA gyrase", "DNA 
ligase", "replication", "helicase", "DNA primase", "DNA repair", "cell division", and "septum" for the 
category "cell cycle". The exponents resulting from this (crude) annotation scheme are also shown in 
table 121 As table |2l shows, there is good quantitative agreement between the exponents that are obtained 
with the different annotation schemes. 
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Table 2: Estimated 99% posterior probability intervals for the scaling exponents obtained with three 
different annotation schemes. 



Annotation 


Category 


Exponent 


GO 

COG 
NCBI 


Protein biosynthesis 
Translation, ribosomal structure and biogenesis 
Protein biosynthesis 


0.11-0.15 

0.21-0.37 
0.09-0.15 


GO 
COG 


Signal transduction 
Signal transduction mechanisms 


1.55-1.9 
1.66-2.14 


GO 
GO 
COG 

NCBI 


Protein metabolism and modification 
Protein degradation 
Posttranslational modification, protein turnover, chaperones 
Protein degradation 


0.68-0.8 
0.89-1.06 
0.88-1.15 
0.68-0.91 


GO 
GO 
COG 
NCBI 


Cell cycle 
DNA repair 
Replication, recombination and repair 
Cell cycle 


0.39-0.54 
0.52-1.14 
0.66-0.83 
0.45-0.64 


GO 
COG 
NCBI 


Ion transport 
Inorganic ion transport and metabolism 
Ion transport 


1.15-1.7 
1.19-1.47 
1.12-1.88 


GO 

COG 
NCBI 


Regulation of transcription 
Transcription regulation 
Transcription 


1.74-2 

1.69-2.36 
1.9-2.42 


GO 
NCBI 


Kinase 
Kinase 


0.96-1.16 
0.8-1.03 


GO 
NCBI 


Transport 
Transport 


1.08-1.32 
1.16-1.5 
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